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ABSTRACT 

A  general  framework  based  on  Gaussian  models  and  a  MAP- 
EM  algorithm  is  introduced  in  this  paper  for  solving  ma¬ 
trix/table  completion  problems.  The  numerical  experiments 
with  the  standard  and  challenging  movie  ratings  data  show 
that  the  proposed  approach,  based  on  probably  one  of  the 
simplest  probabilistic  models,  leads  to  the  results  in  the  same 
ballpark  as  the  state-of-the-art,  at  a  lower  computational  cost. 

Index  Terms —  Matrix  completion,  inverse  problems, 
collaborative  filtering,  Gaussian  mixture  models,  MAP  esti¬ 
mation,  EM  algorithm 

1.  INTRODUCTION 

Matrix  completion  amounts  to  estimate  all  the  entries  in  a  ma¬ 
trix  F  S  from  a  partial  and  potentially  noisy  observa¬ 

tion 

Y  =  H*F-bW,  (1) 

where  H  G  is  a  binary  matrix  with  1/0  entries  masking 

a  portion  of  F  through  the  element-wise  multiplication  •,  and 
W  G  an  additive  noise.  With  matrix  rows,  columns, 

and  entry  values  assigned  to  various  attributes,  matrix  com¬ 
pletion  may  have  numerous  applications.  For  example,  when 
the  rows  and  the  columns  of  F  are  attributed  to  users  and  items 
such  as  movies  and  books,  and  an  entry  at  position  (m,n) 
records  a  score  given  by  the  m-th  user  to  the  n-th  item,  matrix 
completion  predicts  users’  scores  on  the  items  they  have  not 
yet  rated,  based  on  the  available  scores  recorded  in  Y,  so  that 
personalized  item  recommendation  system  becomes  possible. 
This  is  a  classical  problem  in  collaborative  filtering. 

To  solve  an  ill-posed  matrix  completion  problem,  one 
must  rely  on  some  prior  information,  or  in  other  words,  some 
data  model.  The  most  popular  family  of  approaches  in  the 
literature  assumes  that  the  matrix  F  follows  approximately  a 
low  rank  model,  and  calculates  the  matrix  completion  with 
a  matrix  factorization  umii.  Theoretical  results  regarding 
the  completion  of  low-rank  matrices  have  been  recently  ob¬ 
tained  as  well,  e.g.,  Ham  and  references  therein.  More 
elaborative  probabilistic  models  and  some  refinement  have 
been  further  studied  on  top  of  matrix  factorization,  leading  to 
state-of-the-art  results  Eiiiiini  • 

In  image  processing,  assuming  that  local  image  patches 
follow  Gaussian  mixture  models  (GMM),  Yu,  Sapiro  and 


Mallat  have  recently  reported  excellent  results  in  a  number  of 
inverse  problems  M-  In  particular,  for  inpainting,  which  is 
an  analogue  to  matrix  completion  where  the  data  is  an  image, 
the  maximum  a  posteriori  expectation-maximization  (MAP- 
EM)  algorithm  for  solving  the  GMM  leads  to  state-of-the-art 
results,  with  a  computational  complexity  considerably  re¬ 
duced  with  respect  to  the  previous  state-of-the-art  approaches 
based  on  sparse  models  El,  (which  are  analogous  to  the 
low-rank  model  assumption  for  matrices). 

In  this  paper,  we  investigate  Gaussian  modeling  (a  par¬ 
ticular  case  of  GMM  with  only  a  single  Gaussian  distribu¬ 
tion)  for  matrix  completion.  Subparts  of  the  matrix,  typically 
rows  or  columns,  are  regarded  as  a  collection  of  signals  that 
are  assumed  to  follow  a  Gaussian  distribution.  An  efficient 
MAP-EM  algorithm  is  introduced  to  estimate  both  the  Gaus¬ 
sian  parameters  (mean  and  covariance)  and  the  signals.  We 
show  through  numerical  experiments  that  the  fast  MAP-EM 
algorithm,  based  on  the  Gaussian  model  which  is  the  simplest 
probabilistic  model  one  can  imagine,  leads  to  results  in  the 
same  ballpark  as  the  state-of-the-art  in  movie  rating  predic¬ 
tion,  at  significantly  lower  computational  cost.  Recent  theo¬ 
retical  results  ifTSll  further  support  the  consideration  of  Gaus¬ 
sian  models  for  the  recovery  of  missing  data. 

Section  |2]  introduces  the  Gaussian  model  and  the  MAP- 
EM  algorithm.  After  presenting  the  numerical  experiments  in 
Sectionj^  Section|^concludes  with  some  discussions. 


2.  MODEL  AND  ALGORITHM 
2.1.  Gaussian  Model 

Similar  to  the  local  patch  decomposition  often  applied  in  im¬ 
age  processing  m,  let  us  consider  each  subpart  of  the  matrix 
F  G  the  /-th  row  f,-  G  for  example,  as  a  signal.  Let 

h,  G  denote  the  /-th  row  in  the  binary  matrix  H,  and  let 
A,-  =  I  { y  |hi  (j)  ^0,1  <  j  <  N}  I  be  the  number  of  non-zero  en¬ 
tries  in  h,-.  Let  U,  G  denote  a  masking  operator  which 

maps  from  'MF  to  extracting  entries  of  f,  corresponding  to 
the  non-zero  entries  of  h,,  i.e.,  all  but  the  (Idx(h,,k))-th  the 
entries  in  the  k-th  row  of  U,  are  zero,  with  (Idx(h;,k))  the  in¬ 
dex  of  the  k-th  non-zero  entry  in  h,,  1  <  k  <  A,  .  Let  y,-  G 
and  w,-  G  be  respectively  the  sub-vector  of  the  /-th  row 
of  Y  and  W,  where  the  entries  of  h,  are  non-zero.  With  this 


notation,  we  can  rewrite  Q  in  a  more  general  linear  model 

y,- =  U;fi  +  w,-,  (2) 

for  all  the  signals  f,,  1  <  i  <  M.  Note  that  f,  can  also  be 
columns,  or  2D  sub-matrices  of  F  rendered  in  ID. 

The  Gaussian  model  assumes  that  each  signal  f,  is  drawn 
from  a  Gaussian  distribution,  with  a  probability  density  func¬ 
tion 

where  H  and  /r  are  the  unknown  covariance  and  mean  param¬ 
eters  of  the  Gaussian  distribution.  The  noise  w,  is  assumed  to 
follow  a  Gaussian  distribution  with  zero  mean  and  covariance 
Ew,  here  assumed  to  be  known  (or  calibrated). 

Estimating  the  matrix  F  from  the  partial  observation  Y 
can  thus  be  casted  into  the  following  problem: 

•  Estimate  the  Gaussian  parameters  from  the  ob¬ 

servation  {y,}i<KM- 

•  Estimate  f,-  from  y,-,  \  <i  <M,  using  the  Gaussian  dis¬ 
tribution  ^(/r,E). 

Since  this  is  a  non-convex  problem,  we  present  an  effi¬ 
cient  maximum  a  posteriori  expectation-maximization  (MAP- 
EM)  algorithm  that  calculates  a  local-minimum  solution  Ul 

[I6l. 


is  a  linear  estimator  and  is  optimal  in  the  sense  that  it  mini¬ 
mizes  the  mean  square  error  (MSE),  i.e.,  Ef.  = 

miUgEf.  —  g(y,)|j2],  as  well  as  the  mean  absolute  er¬ 
ror  (MAE),  i.e.,  £f,._„J||f;-fi||i]  =  ming£f.,„J||f,- -  g(y;)||  i], 
where  g  is  any  mapping  from  E).  The  second 

equality  of  Q  follows  from  the  Bayes  rule,  the  third  follows 
from  the  Gaussian  models  f,-  ^  and  w,-  ~ 

and  the  last  is  obtained  by  deriving  the  third  line  with  respect 
to  f,  . 

The  close-form  MAP  estimate  Q  can  be  calculated  fast. 
Observe  that  U,-  G  is  a  sparse  extraction  matrix,  each 

row  containing  only  one  non-zero  entry  with  value  1,  whose 
index  corresponds  to  the  non-zero  entry  in  h,.  Therefore, 
the  multiplication  operations  that  involve  U;  or  Uf  can  be 
realized  by  extracting  the  appropriate  rows  or  columns  at 
zero  computational  cost.  The  complexity  of  Ql  is  there¬ 
fore  dominated  by  the  matrix  inversion.  As  U,EUf  +  is 
positive-definite,  (U;£Uf  can  be  implemented  with 

Nf  /3  flops  through  a  Cholesky  factorization  111. 

In  a  typical  case  where  f,  is  the  /-th  row  of  the  matrix 
F,  I  <  i  <  M,  to  estimate  F  the  total  complexity  of  the  E- 
step  is  therefore  dominated  by  flops.  Eor  typical 

rating  prediction  datasets  that  are  highly  sparse,  among  a  large 
number  of  items  N,  most  users  have  rated  more  or  less  only  a 
small  number  A,  «  N/C  of  items,  where  C  is  large.  The  total 
complexity  of  the  E-step  is  thus  dominated  by  flops. 


2.2.  Algorithm 


Eollowing  a  simple  initialization,  addressed  in  Section  2.2.3 


the  MAP-EM  algorithm  is  an  iterative  procedure  that  alter¬ 
nates  between  two  steps: 


1.  E-step:  signal  estimation.  Assuming  the  estimates 
(/i,£)  are  known  (following  the  previous  M-step),  for 
each  i  one  computes  the  maximum  a  posteriori  (MAP) 
estimate  f;  of  f,-. 


2.  M-step:  model  estimation.  Assuming  the  signal  esti¬ 
mates  f,,  V/,  are  known  (following  the  previous  E-step), 
one  estimates  (updates)  the  Gaussian  model  parameters 

(A,£)- 


2.2.1.  E-step:  Signal  Estimation 

It  is  well  known  that  under  the  Gaussian  models  assumed 
in  Section  [TT]  the  MAP  estimate  that  maximizes  the  log  a- 
posteriori  probability 

f/  =  arg  max  log  p  (f;  I  y, •,/!,£) 

=  argmax  (logp(y;|f,)  -h  logp(f/|/i,£)) 

=  argm^in  -  y,)  +  (f,-  -  (T  -  A)) 

=  A+£uf(u,£uf+s„)-i(y,--u,'A),  (4) 

*  Writing  y,  in  the  reduced  dimension  W,  leads  to  a  calculation  in  dimen¬ 
sion  Ni  instead  of  N,  which  is  considerably  faster  if  A  ^  N. 


2.2.2.  M-step:  Model  Estimation 

The  parameters  of  the  model  are  estimated/updated  with  the 
maximum  likelihood  estimate. 


(A,£)  =  argmaxlogp({f,}i<KM|A>^)-  (5) 

With  the  Gaussian  model  f,  ~  cyE(/r,E),  it  is  well  known  that 
the  resulting  estimate  is  the  empirical  one, 

1  M  1  M 

A  £=  •  (6) 

i=l  ;=1 

The  empirical  covariance  estimate  may  be  improved 
through  regularization  when  there  is  lack  of  data  (let  us 
take  an  example  of  standard  rating  prediction,  where  there 
are  A  ~  10^  items  and  M  ~  10^  users,  the  dimension  of  the 
covariance  matrix  is  A  x  A  ~  10®).  A  simple  and  standard 
eigenvalue-based  regularization  is  used  here, 

£^£-l-eW,  (7) 

where  e  is  a  small  constant.  The  regularization  also  guaran¬ 
tees  that  the  estimate  £  of  the  covariance  matrix  is  full-rank, 
so  that  0  is  always  well  defined. 

To  estimate  F,  the  computational  complexity  of  the  M- 
step  is  dominated  by  the  calculation  of  the  empirical  covari¬ 
ance  estimate  requiring  MN^  flops,  which  is  negligible  with 
respect  to  the  E-step. 


As  the  MAP-EM  algorithm  iterates,  the  MAP  probability 
of  the  observed  signals  p(f,|y,,/l,£)  increases.  This  can  be 
observed  by  interpreting  the  E-  and  M-steps  as  a  coordinate 
descent  optimization  Q.  In  the  experiments,  the  algorithm 
converges  within  10  iterations. 

2.2.3.  Initialization 

The  MAP-EM  algorithm  is  initialized  with  an  initial  guess  of 
fj,  V/.  The  experiments  show  that  the  result  is  insensitive  to 
the  initialization  for  movie  rating  prediction.  In  the  numerical 
experiments,  all  the  unknown  entries  are  initialized  to  3  for 
datasets  containing  ratings  ranging  from  1  to  5  or  6. 

2.2.4.  Computational  and  Memory  Requirements 

In  a  typical  case  where  the  matrix  row  decomposition  in  (|^ 
is  considered,  the  overall  computational  complexity  of  the 
MAP-EM  to  estimate  an  M  x  A  matrix  is  dominated  by 
with  L  the  number  of  iterations  (typically  <  10) 
and  1/C  the  available  data  ratio,  with  C  typically  large  («  25 
for  the  standard  movie  ratings  data).  The  algorithm  is  thus 
very  fast.  As  each  row  f,  is  treated  as  a  signal  and  the  signals 
can  be  estimated  in  sequence,  the  memory  requirement  is 
dominated  by  (to  store  the  covariance  matrix  E). 

3.  NUMERICAL  EXPERIMENTS 

3.1.  Experimental  Protocols 

The  experimental  protocols  strictly  follow  those  described  in 
the  literature  |il|2llll[l0l[I2[Il[TIl.  The  proposed  method  is 
evaluated  on  two  movie  rating  prediction  benchmark  datasets, 
the  EachMovie  dataset  and  the  IM  MovieLens  dataset.  0 
Two  test  protocols,  the  so-called  “weak  generalization”  and 
“strong  generalization,”  ifTOll  are  applied. 

•  Weak  generalization  measures  the  ability  of  a  method 
to  generalize  to  other  items  rated  by  the  same  users  used 
for  training  the  method.  One  known  rating  is  randomly 
held  out  from  each  user’s  rating  set  to  form  a  test  set, 
the  rest  known  ratings  form  a  training  set.  The  method 
is  trained  using  the  data  in  the  training  set,  and  its  per¬ 
formance  is  evaluated  over  the  test  set. 

•  Strong  generalization  measures  the  ability  of  the 
method  to  generalize  to  some  items  rated  by  novel 
users  that  have  not  been  used  for  training.  The  set  of 
users  is  first  divided  into  training  users  and  test  users. 
Learning  is  performed  with  all  available  ratings  from 
the  training  set.  To  test  the  method,  the  ratings  of  each 
test  users  are  further  split  into  an  observed  set  and  a 
held-out  set.  The  method  is  shown  the  observed  ratings, 
and  is  evaluated  by  predicting  the  held-out  ratings. 

^http://www.grouplens.org/ 


The  EachMovie  dataset  contains  2.8  million  ratings  in  the 
range  { 1 , . . . ,  6}  for  1,648  movies  (columns)  and  74,424  users 
(rows).  Eollowing  the  standard  procedure  ifTOll.  users  with 
fewer  than  20  ratings  and  movies  with  less  than  2  ratings  are 
removed.  This  leaves  us  36,656  users,  1,621  movies,  and  2.5 
million  ratings  (available  data  ratio  4.3%).  We  randomly  se¬ 
lect  30,000  users  for  the  weak  generalization,  and  5,000  users 
for  the  strong  generalization.  The  IM  MovieLens  dataset  con¬ 
tains  1  million  ratings  in  the  range  { 1 , . . . ,  5  }  for  3,900  movies 
(columns)  and  6,040  users  (rows).  The  same  filtering  leaves 
us  6,040  users,  3,592  movies,  and  1  million  ratings  (available 
data  ratio  4.6%).  We  randomly  select  5,000  users  for  the  weak 
generalization,  and  1,000  users  for  the  strong  generalization. 
Each  experiment  is  run  3  times  and  the  average  reported. 

The  performance  of  the  method  is  measured  by  the  stan¬ 
dard  Normalized  Mean  Absolute  Error  (NMAE),  computed 
by  normalizing  the  mean  absolute  error  by  a  factor  for  which 
random  guessing  produces  a  score  of  1.  The  factor  is  1.944 
for  EachMovie,  and  1.6  for  MovieLens. 


In  contrast  to  most  exiting  algorithms  in  the  literature,  the  pro¬ 
posed  method,  thanks  to  its  simplicity,  enjoys  the  advantage 
of  having  very  few  intuitive  parameters.  The  covariance  regu¬ 
larization  parameter  e  in  Q  is  set  equal  to  0.3  (whose  square 
root  is  one  order  of  magnitude  smaller  than  the  maximum  rat¬ 
ing),  the  results  being  insensitive  to  this  value  as  shown  by 
the  experiments.  The  number  of  iterations  of  the  MAP-EM 
algorithm  is  fixed  at  L  =  10,  beyond  which  the  convergence 
of  the  algorithm  is  always  observed.  The  noise  w,  in  ^  is 
neglected,  i.e.,  is  set  to  0,  as  the  movie  rating  datasets 
mainly  involve  missing  data,  the  noise  being  implicit  and  as¬ 
sumed  negligible. 

The  experiments  show  that  considering  row  f,  G  of 
the  matrix  F  G  as  signals  leads  to  slightly  better  re¬ 

sults  than  taking  columns  or  2D  sub-matrices.  This  means 
that  each  user  is  a  signal,  whose  dimension  is  the  number  of 
movies. 

As  in  previous  works  El,  a  post-processing  that  projects 
the  estimated  rating  to  an  integer  within  the  rating  range  is 
performed. 

A  Matlab  code  of  the  proposed  algorithm  is  available  at 
http://www.cmap.polytechnique.fr/^yu/research/MC/demo.html 


The  results  of  the  proposed  method  are  compared  with  the 
best  published  ones  including  User  Rating  Profile  (URP)  M, 
Attitude  lim.  Maximum  Margin  Matrix  Factorization  (MMMF)  lfT4l. 
Ensemble  of  MMMF  (E-MMMF)  H,  Item  Proximity  based 
Collaborative  Filtering  (IPCF)  lfT2l.  Gaussian  Process  Latent 
Variable  Models  (GPLVM)  Q,  Mixed  Membership  Matrix 
Factorization  (M^F)  |0,  and  Nonparametric  Bayesian  Matrix 
Completion  (NBMC)  El-  For  each  of  these  methods,  more 
than  one  results  produced  with  different  configurations  are 


3.2.  Proposed  Method  Setup 


3.3.  Results 


often  reported,  among  which  we  systematically  cite  the  best 
one.  All  these  methods  are  significantly  more  complex  than 
the  one  here  proposed. 

Tables  [T]  and  presents  the  results  of  various  methods 
for  both  weak  and  strong  generalizations  on  the  two  datasets. 
NBMC  generates  most  often  the  best  results,  followed  closely 
by  the  proposed  method  referred  to  as  GM  (Gaussian  Model) 
and  GPLVM,  all  of  them  outperforming  the  other  methods. 
The  results  produced  by  the  proposed  GM,  with  a  by  far  sim¬ 
pler  model  and  faster  algorithm,  is  in  the  same  ballpark  as 
those  of  NBMC  and  GPLVM,  the  difference  with  NMAE  be¬ 
ing  smaller  than  about  0.005,  marginal  in  the  rating  range  that 
goes  from  1  to  5  or  6. 


EachMovie 

Weak  NMAE 

Strong  NMAE 

URP  Goi 

0.4422 

0.4557 

Attitude  mil 

0.4520 

0.4550 

MMMEn^ 

0.4397 

0.4341 

IPCE  Oil 

0.4382 

0.4365 

E-MMME  a 

0.4287 

0.4301 

GPLVM  Q 

0.4179 

0.4134 

M^Eil 

0.4293 

n/a 

NBMC  flTli 

0.4109 

0.4091 

GM  (proposed) 

0.4164 

0.4163 

Table  1.  NMAEs  generated  by  different  methods  for  Each- 
Movie  database.  The  smallest  NMAE  is  in  boldface. 


IM  MovieLens 

Weak  NMAE 

Strong  NMAE 

URPIfTOl 

0.4341 

0.4444 

Attitude  111  11 

0.4320 

0.4375 

mmmeOS 

0.4156 

0.4203 

IPCEIfTH 

0.4096 

0.4113 

E-MMME  11 

0.4029 

0.4071 

GPLVM  Q 

0.4026 

0.3994 

M^P® 

n/a 

n/a 

NBMC  llTTli 

0.3916 

0.3992 

GM  (proposed) 

0.3959 

0.3928 

Table  2.  NMAEs  generated  by  different  methods  for  IM 
MovieLens  database.  The  smallest  NMAE  is  in  boldface. 

4.  CONCLUSION  AND  DISCUSSION 

We  have  shown  that  a  Gaussian  model  and  a  MAP-EM  algo¬ 
rithm  provide  a  simple  and  computational  efficient  solution 
for  matrix  completion,  leading  to  results  in  the  same  ballpark 
as  state-of-the-art  ones  for  movie  rating  prediction. 

Euture  work  may  go  in  several  directions.  On  the  one 
hand,  the  proposed  conceptually  simple  and  computationally 
efficient  method  may  provide  a  good  baseline  for  further  re¬ 
finement,  for  example  by  incorporating  user  and  item  bias  or 
meta  information  ca.  On  the  other  hand,  Gaussian  mix¬ 
ture  models  (GMM)  that  have  been  shown  to  bring  dramatic 


improvements  over  single  Gaussian  models  in  image  inpaint¬ 
ing  M,  are  expected  to  better  capture  different  characteris¬ 
tics  of  various  categories  of  movies  (comedy,  action,  etc.)  and 
classes  of  users  (age,  gender,  etc.).  However,  no  significant 
improvement  by  GMM  over  Gaussian  model  has  yet  been  ob¬ 
served  for  movie  rating  prediction.  This  needs  to  be  further 
investigated,  and  such  improvement  might  come  from  proper 
grouping  and  initialization. 
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